Back

NAR Genomics and Bioinformatics

Oxford University Press (OUP)

Preprints posted in the last 30 days, ranked by how well they match NAR Genomics and Bioinformatics's content profile, based on 214 papers previously published here. The average preprint has a 0.11% match score for this journal, so anything above that is already an above-average fit.

1
Clustering Strategies Improve Structure-Preserving Visualization of Single-Cell RNA-seq Data with CBMAP

Alchaar, M.; Dogan, B.

2026-05-04 bioinformatics 10.64898/2026.04.30.721861 medRxiv
Top 0.1%
10.1%
Show abstract

Dimensionality reduction for visualization is a fundamental step in single-cell RNA sequencing (scRNA-seq) analysis due to the extremely high dimensionality of gene expression profiles. However, widely used nonlinear embedding techniques such as UMAP and t-SNE can introduce substantial distortions when projecting data into two-dimensional space, potentially altering global organization, local neighborhoods, and distance relationships in ways that may mislead downstream biological interpretation. In this study, we investigate the applicability of Clustering-Based Manifold Approximation and Projection (CBMAP) for the visualization of scRNA-seq data and systematically examine how clustering strategies influence the quality of the resulting embeddings. CBMAP was integrated with several clustering algorithms commonly used in single-cell analysis, including k-means, Leiden, HDBSCAN, Secuer, HGC, and FlowSOM. The resulting embeddings were evaluated using quantitative metrics that measure global, local, and distance-level structure preservation and were compared with widely used dimensionality reduction methods such as UMAP, t-SNE, and PaCMAP across multiple benchmark datasets. Our results demonstrate that the clustering stage plays a critical role in determining the structural fidelity of CBMAP embeddings. Clustering algorithms specifically designed for single-cell transcriptomic data, particularly Secuer, produced more consistent preservation of global relationships between cell populations. Across multiple datasets, CBMAP more faithfully preserved global structural organization and inter-population distance relationships than the compared methods, although local neighborhood preservation was generally weaker than in techniques optimized for local structure. Importantly, CBMAP embeddings retained biologically meaningful relationships in trajectory benchmark datasets. When combined with RNA velocity analysis, CBMAP successfully preserved cyclic progenitor states and branching differentiation trajectories, demonstrating compatibility with trajectory-aware visualization. These findings indicate that CBMAP provides a structure-faithful visualization framework for scRNA-seq data and that clustering selection plays a central role in determining embedding quality.

2
On the benchmarking of clustering algorithms and hyperparameter influence for cell type detection in single-cell RNA sequencing data.

Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.

2026-05-17 bioinformatics 10.1101/2025.08.20.671270 medRxiv
Top 0.1%
8.4%
Show abstract

Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.

3
Building computational benchmarks: an Omnibenchmark reimplementation of a single-cell preprocessing pipeline evaluation

Choudhury, A.; Kitak, T.; Carrillo, B.; Busch, P.; Emons, M.; Gunz, S.; Koderman, M.; Luo, S.; Mallona, I.; Meara, A.; Wissel, D.; Robinson, M. D.

2026-05-05 bioinformatics 10.64898/2026.05.01.722166 medRxiv
Top 0.1%
8.2%
Show abstract

In the past few years, we have seen a veritable surge in single-cell (e.g., RNA sequencing) techniques and datasets, enabling increasingly detailed characterization of cellular heterogeneity across tissues and conditions. This surge in single-cell techniques has been complemented by a large number of analysis frameworks and pipelines, and a large parameter space and researcher degrees of freedom to use them. Many neutral benchmarks have been presented for various computational tasks, but most make design decisions that render them incompatible with each other, e.g., different datasets and metrics, or parameter sets used. In this work, we showcase a recently developed framework, Omnibenchmark, to build reproducible, extensible and standardized method comparisons. This not only facilitates the broad investigation of pipelines used in single-cell data analysis, but also highlights how the process of building benchmarks can be streamlined and unified. We do this as an initial proof-of-principle for an arms-length benchmark that evaluates five single-cell RNA sequencing pipelines (filtering to normalization to dimensionality reduction to clustering) on three datasets. This standardization enables benchmarks to be easily extended in several directions, including broader parameter sweeps, comparisons across software versions and architectures, isolation of pipeline steps, and integration of additional pipelines, datasets, and metrics.

4
Integrating trajectory inference and gene regulatory network analysis to resolve transcriptional programs of T cell state transitions in the tumor microenvironment

Casals-Franch, R.; Nonell, L.; Villa-Freixa, J.; Lopez Garcia de Lomana, A.

2026-05-14 bioinformatics 10.64898/2026.05.12.724558 medRxiv
Top 0.1%
8.1%
Show abstract

Reconstructing dynamic immune cell state transitions from single-cell transcriptomic data requires coordinated analytical strategies that capture both phenotypic progression and underlying regulatory programs. This protocol describes a step-by-step computational workflow for analyzing human tumor-infiltrating T cells using the sequential application of dimensionality reduction, pseudotime trajectory inference, regulon activity analysis, and transcription factor-transcription factor network reconstruction. The workflow outlines data preprocessing and quality control, trajectory rooting and parameter selection, branch-specific differential analysis, and the integration of regulon inference to contextualize transcriptional programs along inferred trajectories. Regulon-based TF-TF network reconstruction is used as a downstream interpretive layer to identify regulatory modules associated with distinct cell-state transitions. Publicly available at GitHub repository https://github.com/rogercasalsfr/immuno-trajectory-grn-integrative-workflow, this protocol emphasizes practical considerations including parameter sensitivity, trajectory robustness, and consistency between phenotypic and regulatory outputs. The protocol supports reproducible analysis and interpretation of immune cell dynamics in human tumor microenvironment studies using single-cell RNA sequencing data.

5
Quantifying Cross-Modal Association Confidence for Single-Cell RNA-ATAC Integration

Furutani, T.; Ji, H.

2026-05-12 bioinformatics 10.64898/2026.05.07.723400 medRxiv
Top 0.1%
7.2%
Show abstract

While multimodal sequencing technologies are rapidly advancing, most single-cell and spatial datasets still measure only a single modality. Integrative computational methods for separately profiled single-cell RNA-seq (scRNA-seq) and ATAC-seq (scATAC-seq) data typically rely on the assumption that gene expression correlates with the chromatin accessibility of nearby regulatory regions. However, the strength and reliability of these correlations vary substantially across genes, and incorporating low-confidence associations can compromise integration accuracy. Here, we introduce the CLIC (Cross-modality Link Confidence) score, a quantitative measure of the empirical concordance between gene expression and nearby chromatin accessibility, derived from diverse single-cell multiome datasets from the ENCODE project. CLIC scores provide prior confidence estimates for gene-peak associations across modalities. Building on this, we propose a hybrid feature selection strategy that intersects highly variable genes with high-CLIC genes, generating feature sets that better align with the assumptions of cross-modal integration methods. Across diverse publicly available single-cell and spatial datasets, and multiple state-of-the-art integration frameworks, our approach consistently improves the integration of gene expression and chromatin accessibility data, enhancing both robustness and biological interpretability. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=69 SRC="FIGDIR/small/723400v1_ufig1.gif" ALT="Figure 1"> View larger version (18K): org.highwire.dtl.DTLVardef@13208b8org.highwire.dtl.DTLVardef@1da7808org.highwire.dtl.DTLVardef@1fe5c53org.highwire.dtl.DTLVardef@5f4e2a_HPS_FORMAT_FIGEXP M_FIG C_FIG

6
Robust data-driven gene expression inference for RNA-seq using curated intergenic regions

Brandulas Cammarata, A.; Fonseca Costa, S. S.; Rosikiewicz, M.; Roux, J.; Wollbrett, J.; Bastian, F. B.; Robinson-Rechavi, M.

2026-05-20 genomics 10.1101/2022.03.31.486555 medRxiv
Top 0.2%
6.7%
Show abstract

RNA-Seq is a powerful technique to provide quantitative information on gene expression. While many applications focus on measuring expression levels, accurately distinguishing between actively and inactively transcribed genes is equally important for understanding gene function, development, and disease mechanisms. However, setting a biologically meaningful threshold for calling genes expressed is challenging due to variability in noise levels across different protocols, experiments or biological samples. We propose to define this threshold per sample relative to the background level observed in inactive genomic features, inferred by the amount of reads mapped to intergenic regions of the genome, and to call genes expressed if their level of expression is significantly higher than the estimated background noise. This approach can be applied to a single RNA-Seq library as well as to a combination of libraries from the same condition, in model and non-model organisms. We show that our method yields a more accurate prediction of expression state than existing methods, illustrated by consistent expression calls for biological replicates in the same tissue.

7
geneML: Gene annotation across diverse fungal species using deep learning

Vader, L.; Harvey, C. J.; Weber, T.; Hon, L. S.

2026-05-21 bioinformatics 10.64898/2026.05.18.725946 medRxiv
Top 0.2%
6.6%
Show abstract

Accurate gene prediction remains a major bottleneck in fungal genomics, where lineage diversity and alternative splicing challenge existing ab initio methods. Here, we present geneML, a deep learning-based gene prediction tool tailored to fungal genomes. Across nine reference genomes spanning diverse fungal taxa, geneML improved gene-level F1 score from 64.9 to 67.1 compared to BRAKER3 with protein-based hints, driven by substantially higher recall (69.0 vs. 64.1) at equivalent precision. geneML also remains fast, averaging around 6 minutes per genome on a standard 8-core CPU. A key feature of geneML is its ability to predict alternative transcripts. Compared to Fusarium graminearum Iso-Seq control data, it achieves 41.1% transcript recall and 71.1% precision, outperforming AUGUSTUS (33.8% recall, 48.9% precision), one of the few tools that support isoform prediction. The predicted transcript diversity is consistent with experimentally observed fungal alternative splicing patterns. Reannotation of the curated training dataset further suggests improved biological completeness, with geneML recovering 15.3% more genes containing complete PFAM domains than the reference annotation. These results demonstrate that geneML enables faster, more sensitive, and more biologically informative fungal genome annotation. geneML is available as an open-source command-line tool at https://github.com/hexagonbio/geneML. Key Points- geneML improves gene prediction accuracy over both classical and recent deep learning-based methods, while substantially improving recall. - geneML predicts alternative transcripts with higher precision and recall than AUGUSTUS, expanding functional annotation. - Runtime was 32-fold decreased over BRAKER3, enabling efficient high-throughput genome annotation. - geneML identifies novel genes and recovers missing annotations, especially in under-annotated non-Ascomycete genomes.

8
An assessment of normalization and differential expression methods for miRNA-seq analysis using a realistic benchmark dataset

Aparicio-Puerta, E.; Baran, A. M.; Ashton, J. M.; Pritchett, E. M.; Gaca, A.; Becker, J.; Halushka, M. K.; Jun, S.-H.; McCall, M. N.

2026-05-13 bioinformatics 10.64898/2026.05.08.723923 medRxiv
Top 0.2%
6.5%
Show abstract

MicroRNAs are short noncoding RNAs that regulate gene expression and are commonly profiled by small RNA sequencing (miRNA-seq). Despite the widespread use of miRNA-seq, datasets are often analyzed with RNA-seq method such as DESeq2 or edgeR, which do not take into account the specific characteristics of miRNA-seq data. Here, we present a benchmark study of normalization and differential expression approaches using a realistic ground-truth dataset. By mixing mouse RNA of two organs, we generated expression trends while capturing biological and technical variability. Using monotonicity across the dataset and expected fold changes from the mixture design, we assessed normalization and differential expression methods. Normalization benchmarking showed that within-sample scaling, particularly Read Per Million (RPM), best preserved the expected monotonic trends, outperforming cross-sample methods such as TMM, rlog, and VST. These approaches sometimes recovered apparent monotonicity among abundant miRNAs, but inspection of individual profiles suggested likely over-correction. Regarding differential expression, edgeR consistently ranked among the best-performing methods across several metrics, including log2 fold-change estimation, with performance comparable to miRNA-seq-specific tools such as miRglmm and NBSR. DESeq2, edgeR-v4, and limma-based approaches tended to systematically underestimate log2 fold changes. Applying a common RPM-based normalization substantially improved the performance of cross-sample methods, highlighting the strong influence of normalization on differential expression analysis. Overall, our findings support within-sample scaling methods such as RPM for normalization, and edgeR, miRglmm, or NBSR for differential expression. The dataset has been made publicly available, providing a valuable resource for objective method comparison and future miRNA-seq software development.

9
Viral non-coding RNA structure annotation and API-based data retrieval with Rfam and R2DT

Muston, P.; Triebel, S.; Nawrocki, E.; Ontiveros-Palacios, N.; Jandalala, I.; Sweeney, B.; Bateman, A.; Marz, M.; Petrov, A. I.; Madrigal, P.

2026-05-14 bioinformatics 10.64898/2026.05.10.724034 medRxiv
Top 0.2%
6.4%
Show abstract

Rfam is a comprehensive database of non-coding RNA (ncRNA) families providing curated sequence alignments, consensus secondary structures, and covariance models for thousands of RNA families. The database is essential for identifying structured non-coding RNAs in newly sequenced genomes and understanding RNA structure-function relationships. Here we present computational protocols for automated ncRNA annotation of viral genomes, and for programmatic interaction with Rfam through its RESTful API. We showcase genome-wide RNA structure visualization from a genome sequence and from a multiple sequence alignment by generating comprehensive 2D structure diagrams using newly developed features in R2DT. We also present practical examples for retrieving family metadata, downloading alignments, accessing secondary structures, and searching user sequences from the Rfam API. These methods enable researchers in virology and RNA biology to integrate Rfam data into custom bioinformatics pipelines, comparative analyses, and machine learning workflows.

10
Toward a probabilistic definition of chromatin accessible regions at the single-cell level

Sanchez-Escabias, E.; Rico, D.; Reyes, J. C.

2026-05-04 genomics 10.64898/2026.05.01.722232 medRxiv
Top 0.2%
6.3%
Show abstract

Understanding cis-regulatory elements (CREs) at the single cell level is fundamental to deciphering transcriptional changes during development, cell differentiation, and homeostasis. Recent studies have shown that arbitrary peak-calling thresholds complicate data interpretation and cross-study comparisons. Furthermore, due to the inherent sparsity of single-nuclei ATAC-seq (snATAC-seq) data, distinguishing between truly inaccessible regions and technical dropouts remains challenging. Our analysis of snATAC-seq experiments performed in a well-established cell line suggests that the dichotomy between accessible (open) or inaccessible (close) CREs is misleading. Thousands of accessible regions are present in a very small fraction of cells of the population but they are repeatedly identified, suggesting that they have a low accessibility or are only transiently accessible. However, depending on the detection threshold selected they could be considered as either genuine CREs or noise. To resolve this inconsistency, we propose a model where chromatin accessibility is treated as a continuum, defined by a probability of accessibility (Pa) for each accessible region across cell types and conditions. Through computational simulations, we demonstrate that snATAC-seq results can be explained by a simple "balls into bins" probability model, offering a theoretical framework for calculating Pa distributions from any snATAC-seq dataset. Furthermore, we examine how Pa distributions shift following activation of the TGF{beta} signaling pathway. This probabilistic approach removes the reliance on arbitrary thresholds and supports a more quantitative, and dynamic understanding of accessible regions function.

11
De novo protein discovery in non-model organisms

Ali, A.

2026-05-13 bioinformatics 10.64898/2026.05.08.723910 medRxiv
Top 0.3%
6.2%
Show abstract

We developed plant (Parallel Annotation of Transcriptomes), a de novo method that can potentially compare RNA-seq data of any two species without a reference genome. plant is conceptually similar to chromatography. In the same way a complex mixture is filtered to isolate its individual components, we applied a computational method to identify, annotate, and quantify components across transcriptomes. The comparison points are universal protein domain annotations rather than species-specific genes, as would be the case for a differential gene expression analysis. We looked at several Selaginella species via the 1000 Plant transcriptomes initiative (1KP) where RNA-seq data for various plant species have been made publicly available. The raw reads were assembled via Trinity. The assembled transcripts were then searched against the Pfam protein domain database via InterProScan. The assembled transcripts were also quantified via kallisto. By merging these two aspects, we were able to see how often a particular protein domain - a predicted protein structure - is expressed. These quantified annotations of protein domains are comparable across species, assuming a relatively short evolutionary distance. We were also able to identify the presence of species-specific protein domains and trace each annotation back to the gene. A bubble plot was created to visualize the distributions of Pfam annotations across species as well as GO terms.

12
On the state of protein function prediction: a report on the fourth CAFA challenge

Ramola, R.; De Paolis Klauza, M. C.; Piovesan, D.; Peng, Y.; Joshi, P.; Mehdiabadi, M.; Quaglia, F.; Pancsa, R.; Chemes, L. B.; Ahmadi, M.; Ahn, H.; Altenhoff, A. M.; Asgari, E.; Aspromonte, M. C.; Atalay, V.; Babbi, G.; Baldazzi, D.; Barot, M. M.; Ben-Hur, A.; Benso, A.; Berenberg, D.; Bjorne, J.; Boecker, F.; Boldi, P.; Bonello, J.; Bordin, N.; Borole, P.; Ebrahimpour Boroojeny, A.; Cao, R.; Di Carlo, S.; Casadio, R.; Casiraghi, E.; Chang, J.-M.; Chen, C.; Chen, T.-M.; Cheng, J.; Chiu, S.; Dalkiran, A.; Davidovic, R. S.; Dessimoz, C.; Diao, R.; Djeddi, W. E.; Dogan, T.; Flannery, S. T.; Font

2026-05-11 bioinformatics 10.64898/2026.05.06.722942 medRxiv
Top 0.3%
4.9%
Show abstract

BackgroundThe Critical Assessment of Functional Annotation (CAFA) is a community effort held to understand the field of computational protein function prediction. Every three years, since 2010, the organizers initiate an experiment to collect function predictions on a large set of proteins and then evaluate the performance of predicting methods on a subset of proteins that have accumulated experimental annotations between the submission deadline and the evaluation time. CAFA provides an independent and rigorous assessment of the current state of the art, thus leveling the playing field, highlighting successes, revealing bottlenecks, and offering a forum for the exchange of ideas in protein science. Here, we report the results of the fourth CAFA experiment (CAFA4). ResultsCAFA4 featured the participation of 148 methods from 70 research groups on a total of 46,205 unique proteins over a 5-year annotation accumulation phase, the longest in any CAFA. In a comparison across CAFA2-CAFA4 methods, the prediction of Gene Ontology (GO) terms has clearly improved across all three GO aspects and traditional evaluation settings. While not achieving the first rank, several CAFA2 and CAFA3 methods featured in the top ten methods in many evaluations, suggesting that earlier methods still hold relevance. The performance is weaker in the newly introduced "partial knowledge" evaluation category (proteins with experimental annotations before submission deadline that gained additional annotations in the same GO aspect during the annotation accumulation phase), highlighting the need for a new class of methods. The rankings of the methods were stable over the years in traditional evaluation settings, but less so in the new partial knowledge evaluation. Overall, the field continues to progress with some influx of new participants. Sustained efforts will be necessary to substantially advance it.

13
Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv
Top 0.3%
4.9%
Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

14
TreeGazer: Prospecting Protein Sequence-Function Landscapes via Phylogenetic Structure

Porras, S. A.; Davis, S. J.; Paredes Trujillo, O. D.; Diep, P.; Schenk, G.; Boden, M.

2026-05-17 bioinformatics 10.64898/2026.05.14.725301 medRxiv
Top 0.3%
4.8%
Show abstract

Building diverse and informative protein sequence datasets is critical for understanding how function varies across sequence space. Because only a small fraction of sequences in a dataset can typically be experimentally characterised, strategies for selecting what sequences to characterise should maximise the information gained from each experiment. Here, we present TreeGazer, a phylogeny-informed framework that combines Bayesian optimisation with the topology of a tree to guide sequence selection. TreeGazer balances exploitation of sequences predicted to exhibit favourable properties against exploration of regions higher model uncertainty. Unlike existing approaches that apply Bayesian optimisation for sequence selection, TreeGazer does not rely on black-box models and instead uses latent representations of property distributions that are directly tied to phylogenetic structure. Modelling properties in this way enables biologically interpretable predictions and uncertainty estimates. Across two simulated selection campaigns, TreeGazer consistently selected sequences that produced datasets more representative of the underlying property distribution than alternative strategies that used protein language models. TreeGazer also performed effectively in low-data settings, where tree-guided selection enabled accurate identification of functional transitions across clades. TreeGazer can be run on conventional laptop computers while still providing equivalent or superior performance to embedding-based approaches. These results demonstrate that phylogenetic structure is a powerful and underutilised prior for guiding informative sequence selection.

15
Classic machine learning on top of multiple position weight matrices improves genomic prediction of transcription factor binding sites

Kravchenko, P.; Vorontsov, I. E.; Makeev, V. J.; Kulakovskiy, I. V.; Penzar, D. D.

2026-05-14 bioinformatics 10.64898/2026.05.12.724515 medRxiv
Top 0.4%
4.8%
Show abstract

MotivationDNA motifs recognised by transcription factors are typically represented as position weight matrices (PWMs), assuming independent contributions of individual nucleotides to protein binding specificity. Many alternative models accounting for correlations of positional contributions have been introduced in the past decades. However, performance gains have generally not outweighed the advantages of simplicity, interpretability, and practical applicability of PWMs with the well-established codebase. Existing software tools and motif databases provide multiple non-identical PWMs for the same transcription factor or even for the same dataset. It remains a practical question whether these PWMs can be effectively combined into a single improved model. ResultsHere we describe ArChIPelago (https://github.com/autosome-ru/ArChIPelago), a computational framework that combines multiple PWMs into a joint model using classic machine learning techniques, from linear regression to ensembles of decision trees. We show that such a combination improves prediction of transcription factor binding sites in genomic sequences. With a diverse collection of 704 ChIP-Seq datasets spanning 36 orthologous human and mouse transcription factors of diverse structural families, we show that ArChIPelago consistently outperforms the best available individual mono- and dinucleotide PWMs as well as sparse local inhomogeneous mixture models. Furthermore, using both human and mouse data, we demonstrate that PWM ensembles are capable of making reliable cross-species predictions.

16
AbSolution: interactive exploration of sequence-derived features in AIRR-seq repertoires

Garcia-Valiente, R.; Triantafyllou, C.; van Schaik, B.; Jongejan, A.; Pollastro, S.; Anang, D. C.; Guikema, J. E.; de Vries, N.; Hoefsloot, H. C.; van Kampen, A. H. C.

2026-05-22 bioinformatics 10.64898/2026.05.20.726477 medRxiv
Top 0.4%
4.8%
Show abstract

High-throughput sequencing of B-cell and T-cell immune receptor repertoires provides unprecedented insight into adaptive immune responses. The data produced are structured by clonal relationships and somatic mutation signatures, and yield extremely rich information in sequence-derived features, including physicochemical properties and compositional patterns. However, integrated analysis across datasets, conditions, and time points remains challenging. Current analytical tools typically focus only on certain features within individual repertoires, without enabling integrated, multivariable comparisons across datasets, conditions, and time points to address their diversity and variability. Here we present AbSolution, a user-friendly and flexible interactive application for comprehensive exploration of immune repertoires and their sequence-based properties. AbSolution enables multiscale analysis of thousands of sequence-derived features across receptor regions, while accounting for V(D)J usage, clonal composition and experimental groupings. We demonstrate its utility by identifying distinct sequence-based profiles associated with dominant (highly abundant) and non-dominant B-cell clones in peripheral blood BCR repertoires from patients with idiopathic inflammatory myopathies, and with antigen-responsive T-cell populations over time in a longitudinal in vitro antigen-stimulation dataset. Through interactive, interlinked visualizations, statistical feature selection and multi-sample comparisons, AbSolution facilitates integrated feature profiling that supports the interpretation of immune selection processes and enables systematic analysis of complex repertoire datasets.

17
SigBridgeR: An Integrative Framework and Toolkit for Comprehensive Screening and Benchmarking of Phenotype-Associated Cell Subpopulations in Single-Cell Transcriptomics

Yang, Y.; Yan, Z.; Qian, H.; Du, L.; Wang, C.; Peng, Y.; Bu, X.; Zhou, J.-G.; Wang, S.

2026-05-12 bioinformatics 10.64898/2026.05.08.723458 medRxiv
Top 0.4%
4.7%
Show abstract

Single-cell RNA sequencing has revolutionized our understanding of cellular heterogeneity, yet linking specific cell subpopulations to clinically relevant phenotypes remains a persistent challenge. Although multiple computational methods have been developed to bridge this gap, they are typically implemented as standalone packages with heterogeneous preprocessing pipelines, incompatible parameter conventions, and divergent output formats, thereby hindering rigorous cross-method benchmarking and reproducible multi-method workflows. Here, we present SigBridgeR, an extensible R framework and comprehensive toolkit that currently unifies eight state-of-the-art phenotype-associated cell screening algorithms within consistent workflows. We conducted a systematic benchmarking study across four cancer types HER2-positive breast cancer, triple-negative breast cancer, lung adenocarcinoma, and ovarian cancer using both binary phenotypes and patient survival endpoints. Our evaluation incorporated positive and negative control assessments based on differentially expressed genes and randomly selected marker panels, alongside quantitative accuracy comparisons using ground-truth cell labels. Building upon these insights, SigBridgeR provides standardized preprocessing for scRNA-seq and bulk transcriptomic data, unified algorithmic interfaces through a registry-based architecture, ensemble analysis via weighted voting, and comprehensive visualization utilities for multi-method comparison. By lowering technical barriers and promoting methodological standardization, SigBridgeR facilitates reliable discovery of phenotype-relevant cell subpopulations and enhances the translational potential of single-cell omics research.

18
DamageFormer: a damage-aware multimodal deep learning framework for DNA lesion identification from nanopore sequencing

Yang, Q.; Li, L.; Ma, Q.; Yin, R.

2026-05-18 genomics 10.64898/2026.05.14.725245 medRxiv
Top 0.4%
4.3%
Show abstract

BackgroundDNA lesions arise from endogenous metabolism and environmental exposure and are the major drivers of mutagenesis, aging, and cancer development. However, mapping DNA damage at nucleotide resolution remains a technically challenging task. Nanopore sequencing enables direct detection of chemical perturbations through alterations in ionic current signals. Despite this potential, existing computational approaches remain limited in their capacity to generalize across diverse lesion types and to effectively integrate nucleotide sequence context with raw signal information for accurate detection and localization. ResultsWe presented DamageFormer, a multimodal deep learning framework for detection and localization of DNA lesions using native nanopore sequencing data. Central to this framework is LesionBERT, a damage-aware genomic foundation model built upon DNABERT-2 and enhanced with lesion-focused reconstruction objectives to improve representation of chemically modified bases. DamageFormer integrated LesionBERT with a neural signal model through an adaptive gating mechanism, enabling dynamic weighting of sequence context and nanopore signal evidence. The model was trained using a joint objective that combines prediction, localization, and contrastive alignment losses to promote cross-modal coherence and spatial precision. On an oxidative DNA damage benchmark comprising paired sequence and signal data, DamageFormer achieved an AUROC of 0.99997 for lesion detection and a mean absolute localization error of 0.00439, consistently outperforming state-of-the-art baselines. Model interpretation analyses revealed context-dependent modality weighting that adapts to variation in signal quality and sequence ambiguity. The proposed framework further generalizes to chemically distinct guanine lesions not observed during the training process, demonstrating its robustness and transferability to unseen damage types. ConclusionsDamage-aware biological language modeling combined with adaptive multimodal fusion enables accurate and interpretable identification of DNA lesions from nanopore sequencing data. This framework provides a scalable approach for characterizing genome-wide damage landscapes and illustrates how chemical DNA information can be systematically incorporated into genomic language models. The source code and pretrained models of this work are available at: https://github.com/UF-HOBIYin-Lab/DamageFormer.

19
L3R-seq: A long-read 3'RACE approach for deep quantitative analysis of RNA processing

Mamiya, A.; Takenaka, M.; Sugiyama, M.

2026-05-23 bioinformatics 10.64898/2026.05.20.726719 medRxiv
Top 0.4%
4.3%
Show abstract

Long-read sequencing technologies offer the potential to capture multiple RNA processing events within single molecules, but standard protocols suffer from quantification biases and sequencing errors that limit their utility for precise analysis. Here, we describe Long-read 3 RACE-seq (L3R-seq), a targeted long-read sequencing method that ligates a unique molecular identifier (UMI)-containing adapter to the 3 end of RNA molecules prior to reverse transcription and PCR amplification. By grouping cDNA reads sharing the same UMI and generating a consensus sequence for each original RNA molecule, L3R-seq corrects random sequencing errors and mitigates PCR-duplicate-driven quantification biases. The method enables simultaneous, per-molecule analysis of RNA editing, 3 end cleavage and trimming, and polyadenylation status. Along with a step-by-step protocol for library preparation and sequencing with the Oxford Nanopore Technologies (ONT) platform, we describe an accompanying bioinformatic pipeline for consensus generation and extraction of RNA features. As an example, we apply L3R-seq to the mitochondrial mRNA ccmC from Arabidopsis thaliana, a transcript subject to extensive C-to-U editing and non-canonical 3 end processing. The workflow is readily adaptable to other RNAs targets and is transferable to the Pacific Biosciences (PacBio) platform.

20
A Fractal-Dimension Framework for Quantifying Self-Similarity in Chromatin Folding

El-Yaagoubi, A.; Balubaid, A. O.; Chung, M. K.; tegner, j.; Ombao, H.

2026-05-09 bioinformatics 10.64898/2026.05.06.723123 medRxiv
Top 0.4%
4.3%
Show abstract

The three-dimensional folding of DNA is essential for genome function, but its organization remains difficult to summarize quantitatively across genomic scales. Here, we study DNA folding from Hi-C contact data using a network-based notion of fractal dimension. In this representation, genomic loci are treated as nodes, and observed Hi-C contacts define weighted edges, so that frequently interacting loci are closer in the resulting network. We then estimate fractal dimension using two complementary graph-based methods: the correlation dimension and the sandbox dimension. Validation on synthetic networks shows that the proposed estimators detect clear scaling behavior in hierarchical fractal-like networks, while distinguishing them from networks with local clustering but no stable multiscale self-similarity. Applied to intrachromosomal Hi-C data from the IMR90 human cell line, the method reveals approximate linear scaling regimes on log-log plots, suggesting fractal-like organization in chromatin contact networks. At the chromosome level, estimated fractal dimension tends to increase with chromosome size: larger chromosomes often have dimensions closer to 3, consistent with more compact and space-filling organization, whereas shorter chromosomes tend to have lower dimensions, closer to 1, consistent with simpler and more open folding patterns. A sliding-window analysis at 5 kb resolution further shows that fractal organization varies substantially along chromosomes rather than remaining uniform across genomic position. These results suggest that graph-based fractal dimension provides an interpretable summary of DNA folding complexity at both global and local scales. More broadly, the proposed framework offers a quantitative way to study multiscale genome organization from Hi-C data using tools from network geometry.